

# **FPGA AI DATAPATHS**

June 2018

## Density and Flexibility in Computation

Floating Point & Fixed Point

Integer – many different sizes – including asymmetric - mix and match

Floating Point – FP32 and now BFLOAT16

Mixed Representation – Floating Point without Floating Point resources

Structures – Individual MAC or DOT – of any size

Data Movement – 100% sustained to peak

Plus massive internal bandwidth



### FPGA 101

### Customers buy logic

### But they pay for routing



## Base function: 4/6 LUT + register AND2 gate = XOR6 gate But not really Not enough routing – wire limited **Registers are free** Effective FPGA design is mapping to all of the logic available And all of the wires This is much more difficult





#### +100s Tbps local memory bandwidth



## **FPGA FLOATING POINT**

## Floating Point DSP Block

Area, aspect ratio and interface largely dictates DSP Block relationship to device

Wire density key

- Pitch Match to architecture
- 2 major innovations
  - Support FP Multiplier RNE inside integer datapath

While supporting 40 legacy integer modes

Vector Mode builds any size recursive reduction trees





## **Direct Dot Product Support**

- 1. A,B,C...I,J all arrive at the same time
- 2. AB+CD, EF+GH, IJ+KL computed (Re-use systolic connections)

3. Using soft routing, first sum of products fed back to DSP Block inputs

4. Re-use sequential connections, calculate next level of tree





## Yes Virginia, there is a BFLOAT16

Announced at Intel AI DEVCON in May 2018 All Intel products, including FPGA What is BFLOAT16? Introduced by Google February 2018 FP32 reduced to 16 bits

Truncate 16 Mantissa LSBs



Same dynamic range – good for vanishingly small numbers for ML training



# FPGA FLOATING POINT (WITHOUT FLOATING POINT)

## **Floating Point Compiler**

Automatically extracts inter-operator redundancy in group of floating point operator

Typically 50% area reduction

Typically 50% latency reduction

FPGA floating point system design (soft logic based) becomes possible

Single, Double, or Custom Precision (BFP16)

Mixed precisions can be directly mixed with optimized CAST() operators



## FP without FP

Polynomials can have many operators

Expensive - power, area, and latency

Most polynomials have monotonic relationship between terms

First observation : normalization and denormalization redundant

If relationship between terms is known, all shifts can be pre-computed monotonically

FPGAs filled with small ROMs (6LUTs)



# **INTEGER - SOFT**

## **Paper Review**

Use 100% of logic density

More importantly, use 100% of routing density

Independent vs. redundant connections

Refactor to greater than 100% logic density

Use Out-of-band functions

Collapse to single logic level if possible



## Soft Logic Multiplier for Free?

3x3 Multiplier 3 ALMs

ALM – two arithmetic bit output

3 ALMs = 6 outputs = min. 3x3

No point in putting 3x3 multipliers in hard logic

System Cost (routing, logic, power, latency) greater than using inline

### Expand to 4x4 and larger





14

## **Subset Multiplier Extraction**

I thought you just said soft logic multipliers were free? Not so for BFLOAT16 or near BFLOAT (15,14,etc) Multiple 6x6, 7x7 - or asymmetric such as 6x7Can also implement adder tree or portions of it in DSP Block Makes sense if datapath is physically placed near DSP Block DSP Block needs to be inline Mixture of hard and soft logic possible



# **ELEMENTARY FUNCTIONS**

## **RNN and Hyperbolics**



Reducing tanh(x) latency 50% = 70% performance increase!



## Hyperbolic Construction





## **REAL WORLD APPLICATION**

## **Microsoft Brainwave**

### ISCA 2018 Paper

"96,000 multiply-accumulate units"

"287 GFLOPs/W"

"can run all DeepBench layers at under 4ms at batch 1"

"..23% to 75% of peak FLOPs for medium to large LSTM/GRUs (>1500 dimension)"

#### A Configurable Cloud-Scale DNN Processor for Real-Time AI

Jeremy Fowers Kalin Ovtcharov Michael Papamichael Todd Massengill Ming Liu Daniel Lo Shlomi Alkalay Michael Haselman Logan Adams Mahdi Ghandi Stephen Heil Prerak Patel Adam Sapek Gabriel Weisz Lisa Woods Sitaram Lanka Steven K. Reinhardt Adrian M. Caulifeld Eric S. Chung Doug Burger

Microsoft

Abstract—Interactive Al-powerd services require low-latency evaluation of deep neural network (DNN) modes—ala "realtime AI". The growing demand for computationally expensive, state-of-thear IDNS, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (PML), NPL's for interactive services should satisfy two requirements: (1) execution of DNN mode's with low latency, high introughpat, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art modic (e.g., RNNs, CNS, MLD) without costly silicon updates.

This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs. Index Terms-neural network hardware: accelerator architectures; field programmable gate arrays

#### I. INTRODUCTION

Hardware acceleration of deep neural networks (DNNs) is becoming commoplace as the computational complexity of DNN models has grown. Compared to general-purpose CPUs, accelerators reduce both cost and latency for training and serving leading-edge models. Fortunately, the high level of parallelina variable in DNN models makes them amenable to silicon acceleration. With evolving DNN-specific features, QFOPUs have been particularly successful at accelerating DNN workloads. In addition, a Cambrian explosion of new Neural Processing Unit (NPU) architectures is taking place, driven by academic researchers, startups, and large companies. Training and inference (evaluating a trained model) have different requirements, however. Training is primarily a throughput-bound workload and insensitive to the latency of

2575+713X/18/\$31.00 @2018 IEEE DOI 10.1109/ISCA.2018.00012 processing a single sample. Inference, on the other hand, can be much more latency sensitive. DNNs are increasingly used in live, interactive services, such as web search, advertising, interactive speech, and real-time video (e.g., for self-driving cars), where low latency is required to provide smooth user experiences, satisfy service-level agreements (SLAs), and/or meet safety requirements.

Highly parallel architectures with deep pipelines, such as GFGPUs, achieve high throughput on DNN models by hatching evaluations, exploiting parallelism both within and across requests. This approach works well for offline training date the training data set can be partitioned into "minibatches", increasing throughput without significantly impacting convergence. However, systems optimized for batch throughput typically can apply only a fraction of their resources to a single request. In an online inference setting, requests to the arrive one at a time; a throughput architecture must either process these requests individually, leading to reduced throughput while still sustaining batch-equivalent latency, or incur increased latency by waiting for multiple request arrivals to form a batch.

We have developed a full-system architecture for DNN inference that uses a different approach [1], [2], Rather than driving up throughput at the expense of latency by exploiting interrequest parallelism, the system reduces latency by extructing as much parallelism as possible from individual requests. We do not sacrifice throughput but achieve it as the direct result of low single-equest latency. We use the term "teral-time AT to describe DNN inference with no batching. This system, called Project Brainwave (BW for short) achieves much lower latencies than equivalent technologies such as GPGPUs on a wait-for-wat basis, with competitive throughput.

This paper details the architecture and microarchitecture of the BW NFUL which is at the heart of the BW system. In its current form, the BW NFU is a DNN-optimized "soft processor" synthesized onto FPGAs. Despite the lower clock rate and higher area overheads that FPGAs incur, the BW NFU achieves record-setting performance for real-time A1, sustaining 35 Teradops on large RNN benchmarks with no backing. However, only one of the techniques that the BW NFU uses to achieve low lateroy on individual DNN requests is tied to reconfigurable logic, and the rest could be applied to a "hard NFU" with a higher clock rate but reduced flexibility.

> computer society

## Brainwave Floorplan



#### Source: Microsoft Presentation, HotChips2017

Programmable Solutions Group



# CONCLUSIONS



FPGAs can mix IEEE754 FP, custom FP, integer, and combination of numerics simultaneously

Elementary functions – multiple different numerics internally

Can change this from algorithm to algorithm, with multiple different configurations

Very high internal bandwidth and unlimited configurability in connectivity

### **Computational Density and Flexibility**



